1 Introduction

Moving to a new city on a tight budget is challenging. Especially, a metropolis like London has high rents and a competitive market that makes it difficult to find accommodation that has the right attributes at the right price. Sharing economy services like Airbnb have faciliated the search for a spare room rented out by private agent. The available rooms and apartments are furnished for the user to settle right in. But how do you know if the price you are paying for your flat is actually a fair price?

Profits of both hosts and the platform itself have skyrocketed in the past years. A typical UK host earns around £3,000 a year (Cox, 2017a). It is certain that profit comes from the user that is paying both the fee of the platform and the profit margin of the host out of his own pocket. If you are on a tight budget yourself you want to pick a price that is market average with the attributes important to you. This paper aims at creating a model to forecast the price a user pay will pay per night for an Airbnb matching his requirements to faciliate the check whether the price of the apartment is indeed the fair price.

2 Description of the dataset

The dataset for this investigation covers all Airbnb offerings in London as per the 4th and 5th of March 2017. It contains 53,904 observations for 95 different variables. Its source is the website “Inside Airbnb - Adding data to the debate” (Cox, 2017b). This is an independent and non-commercial project aiming to examine the effect of Airbnb activities on urban development.

To allow this investigation to be more focused, the dataset was narrowed down. For example, only private rooms with at least three valid ratings were included. The resulting dataset has 6,495 observations for 78 variables and will be described in the following section.

2.1 Price

Table 1: Summary of Price Variable
Min Q1 Median Mean Q3 Max
8 35 45 50.06994 59 590

Since price is the dependent variable of our investigation, it is very important. The summary statistics show that 75% of all Airbnbs are priced at £59 per night or less. However, there are some severe outliers that range up to a maximum of £590.

This raises concerns about the normality of its distribution. In fact, the plot to the left shows the distribution is not normal. The plot to the right hand side shows that if a logarithmic scale is used for the price, it looks almost normally distributed.

Figure 1: Density of Price and Log10 of Price

Figure 1: Density of Price and Log10 of Price

2.2 Rent

With London being one of the most expensive cities to live, rent prices can be considered a major cost of being a host on AirBnB. Therefore, we would like to observe the relationship between rent and the AirBnB price. However, the initial dataset holds no information on the regular rent price at the location of an Airbnb. Fortunately, a website called “Find Properly” (see Lokku Ltd., 2017) utilizes the data from Zoopla and provides the rent and selling price for each London region, divided per post code. Using the post code, we were able to map the average weekly rent for 1-bed properties to every AirBnB. The matching was done based on the first half of the post code.

Mapping the mean rent and the logarithmically transformed AirBnB price according to their location reveals that those variables are related. Nevertheless, it also becomes clear that there is more to an AirBnB price than just the average rent in the particular neighbourhood.

Figure 2: Mapping Rent Prices vs. Airbnb Prices

Figure 2: Mapping Rent Prices vs. Airbnb Prices

2.3 Location

Table 1: Summary of Price Variable
P-Value Conf Low Estimate Conf High
cor 4.802688e-260 -0.414006 -0.3944327 -0.3744948

When choosing an AirBnB in London, location might be an important factor as for many users living close to the city centre is preferable. In our model, we use the distance to the touristic city centre - Picadilly Circus - as a measurement of the attractiveness of the AirBnB’s location. It was calculated by using the Haversine formula (see Reid, 2011) and the geographic coordinates of Picadilly Circus (Longitude: -0.133869, Latitude: 51.510067).

Figure 3: Mapping Rent Prices vs. Airbnb Prices

Figure 3: Mapping Rent Prices vs. Airbnb Prices

From the boxplot and correlation test above, it seems that the distance to the city center and price are significantly negatively correlated. That means that, statistically speaking, the closer the property is to the city center, the higher the price is.

2.4 Reviews

Additionally to the written reviews, guests can give their hosts star-ratings on the following parameters (see Airbnb Inc., 2017): Overall experience, accuracy, cleanliness, communication, check in, location and value. Overall experience relates to the general impression of the guest and is only calculated for ads with at least three reviews. Accuracy asks how well the ad represented the real properties of the apartment. Cleanliness accounts for tidiness of the flat. Check in and communication are both service-based: Was communication with the host before and during the stay sufficient and was the check in process smooth or difficult? The location is evaluated based on security, comfort and attractiveness of the neighbourhood. Finally, value is a subjective measure to define whether the guests believe that the apartment is worth the price paid - an interesting measure for our analysis.

While the guest gives his or her ratings on a one-to-five-star scale, the data set transforms this data to a rating from 1 to 10, for the overall rating from 0 to 100. In the table below, the average of reviews is very high: At either 9 or 10 for the subrating scores and at 92 for the overall score. Reviews start at values 2 or 4 for the subcategories and 20 for the overall rating. This means, that ads with good ratings are overrepresented suggesting ads with bad reviews will be unlikely to be booked and, therefore, removed from the website. {search for academic paper to prove statement!} As the overall score is individually picked, different subcategories have different effects on the overall rating. Overall score is only moderately correlated to location, communication and cleanliness. Accuracy, check in and value are strongly correlated to the points received in overall rating. Transferring these findings to the analysis implies a higher impact of those variables on the model and shows the necessity to analyse both subcategories and overall rating score as they are given independently. The relation between the different rating scores and price is relatively weak. For none of the categories there is even a weak correlation to price.

Table 1: Summary of Price Variable
Name Minimum Maximum Mean Correlation_Rating Correlation_Price
Accuracy 2 10 9 0.77 0.09
Check In 2 10 10 0.78 0.14
Cleanliness 2 10 9 0.67 0.10
Communication 4 10 10 0.68 0.10
Location 3 10 9 0.54 0.30
Value 2 10 9 0.79 0.04
Overall 20 100 92 1.00 0.13

2.5 Property Characteristics

2.5.1 Accomodates and Beds

Table 1: Summary of Price Variable
Variable P-Value Conf Low Estimate Conf High
Accomodates 6.679629e-187 0.3168958 0.3377851 0.3583468
Beds 1.653387e-71 0.1885833 0.2110457 0.2332873

The variables accomodates and beds give an indication on the overall capacity of the Airbnb. Although the price mostly includes only one guest, a more spacious flat can expected to be more expensive. Therefore, one may expect increasing prices for higher values of these variables.

Running a two-sided correlation, shows a significant and positive linear effect on the logarithmic price for both variables.

2.5.2 Amenities

Figure 3: Mapping Rent Prices vs. Airbnb Prices

Figure 3: Mapping Rent Prices vs. Airbnb Prices

Airbnb includes some general information on the property such as the room type, the number of people that can be accommodated or the number of bathrooms. On top of these characteristics, Airbnb contains information on a wide range of amenities for every flat. These range from the availability of Internet and a TV up to a personal doorman or a pool. In order to analyse these, we introduced dummy variables for 53 different amenities, with 46 resulting in usable data, as well as a variable counting the total number of amenities.

We found 7 amenities which influence the price, including some home essentials such as kitchens, TVs, dryers, and washers, facilities like elevators or whether it’s a family-kid friendly environment as well as whether it provides lock on the bedroom door. The price of the accommodation with TVs, elevators, dryers and washers is higher than those don’t, especially for TV. However, it seems the market doesn’t value those accommodations with family-kid friendly environment and kitchen. Their prices are slightly lower than those without those amenities. Probably, those amenities linked to more work and noisy. Another interesting finding is the room without the lock can may have higher price than others, which may be reasoned that the room with lock may mainly in more unsafe regions.

Table 1: Summary of Price Variable
amenities p_vals x_mean y_mean diff_mean
Washer 3.314698e-02 3.828518 3.802293 0.02622509
TV 4.740326e-63 3.894721 3.736978 0.15774279
Familiy / Kid-Friendly 4.614522e-15 3.871364 3.792786 0.07857772
Dryer 2.721383e-47 3.915857 3.769436 0.14642066
Kitchen 5.209762e-01 3.822668 3.833094 -0.01042686
Elevator in Building 1.299341e-19 3.886817 3.794800 0.09201644
Lock on Bedroom Door 6.579133e-04 3.789237 3.830807 -0.04157048

2.6 Attributes of the ad

Table 1: Summary of Price Variable
attributes p_vals
Instant Bookable 8.435e-01
Cancellation Policy 1.513e-08

In this part, we are going to analyse attributes of the ad. We choose 2 variables that we think may influence the price intuitively, which includes whether it is instant bookable and the cancellation policy.

To attract more customers, sometimes hosts allow instant book of their properties. In terms of instant book, there are 2 kinds of accommodation. In our dataset, TRUE means guests can book the desired property instantly, while FALSE means they have to discuss their plans with the host and wait for approval before they can book. In addition to instant book, hosts also have the right to choose their own cancellation policy. Cancellation policy decides whether or not guests can get refund and how they can be refunded. There are several cancellation policies form which hosts can choose, including flexible, moderate, strict and super strict. If flexible, guests may get full refund if the reservation is cancelled within limited period, mostly 24 hours prior to the check in. If moderate, fees are fully refundable but within a longer time period. Under the circumstances of strict policy, only 50% of fees may be refunded until 1 week prior to check in.

3 Regression model

Regression Results
Dependent variable:
Price (log)
(1) (2)
Mean Rent 0.001*** (0.0001) 0.001*** (0.0001)
Distance -0.013*** (0.001) -0.013*** (0.001)
Accomodates 0.161*** (0.006) 0.151*** (0.005)
Review Score - Rating 0.002** (0.001) 0.002** (0.001)
Review Score - Cleanliness 0.027*** (0.006) 0.028*** (0.006)
Review Score - Location 0.066*** (0.006) 0.068*** (0.006)
Number of Beds -0.024** (0.010)
Amenity - Family friendly 0.007 (0.008)
Amenity - TV 0.120*** (0.008) 0.117*** (0.008)
Amenity - Elevator 0.042*** (0.008) 0.043*** (0.008)
Amenity - Dryer 0.072*** (0.008) 0.061*** (0.008)
Amenity - Kitchen -0.028** (0.014)
Amenity - Washer -0.037*** (0.011)
Amenity - Lock on Bedroom Door -0.045*** (0.010)
Instant bookable - FALSE 0.005 (0.009)
Cancellation Policy - Moderate 0.004 (0.010)
Cancellation Policy - Strict -0.002 (0.009)
Constant 2.037*** (0.060) 1.929*** (0.057)
Observations 7,020 7,020
R2 0.423 0.419
Adjusted R2 0.422 0.418
Residual Std. Error 0.306 (df = 7002) 0.307 (df = 7010)
F Statistic 302.360*** (df = 17; 7002) 561.606*** (df = 9; 7010)
Note: p<0.1; p<0.05; p<0.01

3.1 Interpretation - Correction

As our dependent variable had to be transformed to it’s logarithmic version, a log-linear regression model is used to explain the effect of the independent variables on the dependent variable. Some of the amendities had a negative impact on the price, which is conterintuitive. As the effects are small and likely to be caused by random noise, such variables are excluded. Additionally, only variables that are significant for the regression are kept in the model:

\[\begin{aligned} ln(price) = \beta_0 + \beta_1(mean\_rent) + \beta_2(distance) + \beta_3(accomodates) + \\ \beta_4(review\_scores\_rating) + \beta_5(review\_scores\_cleanliness) + \beta_6(review\_scores\_location) + \\ \beta_7(TV) + \beta_8(elevator) + \beta_9(dryer) + u \end{aligned}\] 41.8 percent of the variation of the observations can be explained with the presented regression model. The standard error of the model in absolute currency is approximately 1.36 GBP off from the real value and the F-statistic is highly significant. Thus, the model provides a far better explanation than just the fit intercept model.The y-intercept is located at 6.88 GBP. However, there will not be an apartment that does have a rent of zero or can accomodate no one. Therefore, the intercept has to used rather carefully. The other coefficients are explaining by how many percentage points the price changes if \[x_i\] changes by one unit. For example, for every additional person a room can accomodate, the price rises by 6.8 percent. The former, the review scores for location as an indicator for attractiveness of the neighbourhood and amenities like the existence of a TV, elevator and dryer have the largest postive effects on the price of a room. The review scores of value had to be taken out of the regression because the inherint endogenity problem: Price is a large factor in determining the review of a guest regarding the price for value. The effect of distance is surprisingly small. This impies that distance to city center is not the best measure to account for geographic differences.

3.2 Fitting the model

Upon testing for multicollinearity, strong correlations between the explanatory variables become clear. A VIF of four implies that the variance of the estimators in the model are four times higher than if the independent variables were uncorrelated. Usually a VIF greater than 5 is considered critical to the model results. None of the used variables reaches that border value. The Durbin-Watson test shows that the error values are uncorrelated, as visible in the plot of residuals against predicted values.

##                 mean_rent                  distance 
##                  1.782593                  1.681637 
##              accommodates      review_scores_rating 
##                  1.039792                  3.097386 
## review_scores_cleanliness    review_scores_location 
##                  2.519707                  1.614785 
##                   amen_TV amen_Elevator_in_building 
##                  1.073304                  1.024791 
##                amen_Dryer 
##                  1.055532
##  lag Autocorrelation D-W Statistic p-value
##    1      0.05957091      1.880332       0
##  Alternative hypothesis: rho != 0

A short exploration of the residuals shows that with rising prices, the residuals increase. This implies that our model is worse in predicting the more expensive rooms as the factors chosen do not fully explain the difference in price. The relation between residuals and prices may be explained by a factor the model could not quantify: the attractiveness of the room and the house it is in. As this attractiveness differs across buildings and sometimes even within a building, it is impossible to predict the price of an apartment that exceeds expectations set by the base explanatory variables used in the regression.

Figure 3: Mapping Rent Prices vs. Airbnb Prices

Figure 3: Mapping Rent Prices vs. Airbnb Prices

high_residuals <- data_short %>%
  select(price, price_log, residuals) %>%
  filter(residuals > 0.3)

low_residuals <- data_short %>%
  select(price, price_log, residuals) %>%
  filter(residuals <= 0.3)

descriptives_residuals <- data.frame(
                 Name=c("Data with high residuals", "Data with low residuals"),
                 
                 Mean_Log_Rent=c(round(mean(high_residuals$price_log), 2), round(mean(low_residuals$price_log),2)),
                 
                 Mean_Rent=c(round(mean(high_residuals$price), 2), round(mean(low_residuals$price),2)))

formattable(descriptives_residuals, list(
  Name=formatter(
    "span",
    style = x ~ ifelse(x == "Overall", 
                       style(font.weight = "bold"), NA))))
Name Mean_Log_Rent Mean_Rent
Data with high residuals 4.36 84.63
Data with low residuals 3.73 44.11

3.3 Limitations of our model

Due to the scope of this assignment we were not able to address every issue with our data. Several issues and will be discussed here.

3.3.1 Omitted variable bias

The price of an AirBnB is affected by a large number of factors. We built a model that includes some of them, but it was not feasible to include data concerning every single possible determinant. As a result, our model likely suffers from omitted variable bias. It under- or overestimates the effect of some of the existing factors to compensate for the missing information, making our model less reliable. Our dataset doesn’t contain several important variables, such as the size of the room, the proximity of the flat to a tube station, the age of the flat or the quality of the equipment and furniture in the flat.

3.3.2 Multicollinearity

TBD based on Nina’s testing section

3.3.3 Sensitivity to outliers

As in any regression model based on ordinary least squares, the coefficients in our model are affected by outliers. Some of the properties in our data set cost more than $400 per night, while most of them cost below $100. The outliers may have disproportionately affect our coefficients, making them less accurate for the remaining variables.

3.3.4 Non-linear relationships

Some of our explanatory variables (for example the distance from the city center) are not linearly related with the price of the property. There is a significant difference between the average price of a room located right in the city center and 5km away, while the difference for rooms located 25km away and 30km away is not very large. This suggests that we could model the relationship more accurately if we used non-linear regression.

ggplot(data = data_short, aes(x = distance, y = price)) + geom_smooth()
## `geom_smooth()` using method = 'gam'

3.3.5 Lack of clustering

By putting all properties into one model, we ignore the fact that there might be different profiles of properties and for each profile, different characteristics might be relatively more important. Perhaps there is a set of properties that are popular with students coming to London for graduate job interviews, who would see location close to the financial centers and low price as important factors. And, perhaps, different types of properties are popular with middle-aged tourists - then the proximity to the popular sights and the level of comfort provided might matter more. If we divided our properties into clusters which share similar characteristics, and then ran a regression analysis for each cluster, we might get a more accurate model for each cluster.

4 Conclusion

Despite the fact that the presented model has obvious limitations regarding factors that could not be quantified, it has direct implications for finding a reasonably prices apartment. Many properties important for someone searching for a room, like WIFI and the existence of a proper equipped kitchen, have small effects on the room price, as they are present to most London based apartments. A traveller can therefore expect to have those properties present. Luxury amendities like the presence of an elevator, a TV and a dryer create costs. Depending on the standards of the guest, these can be added if the budget is extended. It is also good advice to check apartments in less attractive neighbourhoods to save money. Regarding cleanliness, a well maintained room and flat will cost more. Looking at these different attributes of an AirBnB add, the user is able to determine whether the price of the apartment is actually fair, which was the aim of this report. As especially high prices could not be explained in the model, a prediction is likely to return a base price rather than a highly attractive room in a good apartment in a nice building.

Bibliography

Airbnb Inc. (2017) How do star ratings work. [Online]. Available from: https://de.airbnb.com/help/article/1257/how-do-star-ratings-work.

Cox, J. (2017a) Airbnb: Surge in uk hosts over past year boosts local economies. The Independent. [Online] Available from: http://www.independent.co.uk/news/business/news/airbnb-hosts-uk-surge-boost-local-economies-online-holiday-rental-london-southwest-northern-ireland-a7940451.html.

Cox, M. (2017b) Inside airbnb - adding data to the debate. [Online]. Available from: http://data.insideairbnb.com/united-kingdom/england/london/2017-03-04/data/listings.csv.gz.

Lokku Ltd. (2017) London house prices by postcode. [Online]. Available from: https://www.findproperly.co.uk/london/postcode/#.WdvonHeZNn4.

Reid, M. (2011) Haversine formula. [Online]. Available from: http://wordpress.mrreid.org/2011/12/20/haversine-formula/.

Furthermore, for plotting our observations on a ggmap, we consulted the following sources:

Irawan, D.E. (2014) How to convert lat-long coordinates to utm. [Online]. Available from: https://rpubs.com/dasaptaerwin/19879.

Lovelace, R. & Cheshire, J. (2014) Introduction to visualising spatial data in R. National Centre for Research Methods Working Papers. [Online] 14 (03). Available from: https://github.com/Robinlovelace/Creating-maps-in-R.

The header photo was downloaded from Pexels and is licence free. Available from: https://www.pexels.com/photo/architecture-buildings-business-capital-417382/

Imperial College Business School